Online Retail

In this notebook, I will use the dataset to make a machine learning model which allows one to estimate if a given customer will buy something again from the online shop in the next quarter.

Problem Statements

As part of the main problem that will be tackled in this notebook, I will discuss in section named

  1. How many online customers are there in the dataset and what is their country of origin?

  2. What are the countries that are most represented in the dataset?

  3. Calculate the revenue that was made in each month and what is the percentage revenue based on the various countries?

Table of Contents

Importing Relevant Python Packages

Importing Relevant Data

Data Engineering

In this section, we will explore the given dataset to answer some general questions about the dataset. This will lead to cause us to introduce other features into the dataset which will help us to build a machine learning model that will help us to answer our main question.

Let's update the pandas dataframe df by renaming some of its columns.

We can check information about the dataframe with the info method.

From the output of the info method, we can see that the CustomerID and Description columns are likely to have missing values. Also we may want to change the data type of the objects in the InvoiceDate column to proper date objects with the to_datetime method.

Let's count the number of missing values in each column of the dataframe.

As predicted, the columns CustomerID and Description have $243007$ and $4382$ missing data respectively. We may want to drop these missing data from the dataframe using the dropna method.

Next, we update the dataframe df_data by converting the date field, InvoiceDate to datetime object. One can use the to_datetime method to achieve this.

From the above, we see that the online purchases made by customers was from 2009-12-01 to 2011-12-09.

In the subsequent cells, we will answer some questions we would like to know from the given dataset.

Exploration Question 1

How many online customers are there in the dataset and what is their country of origin?

Exploration Question 2

What are the countries that are most represented in the dataset?

The output above, shows the top $10$ countries of that purchase from the online retail shop with the United Kingdom being the country with the highest customers. They represent about $90.85\%$ of the entire online customers.

Although $13$ online customers indicated that they were living in two countries, this does not significantly affect the values percetage values recorded. For these $13$ online customers, it may be that the countries recorded for their last purchases is where they are currently living.

In the two next cells, we group countries with customer percentage value less than or equal to $0.25$ together and then plot a pie chart.

Exploration Question 3

Calculate the revenue that was made in each month and what is the percentage revenue based on countries?

From the graph figure above, we observe that the online retail shop made the highest sales in November 2010, followed by November 2011. Also, one can observe that, monthly sales rises after August.

Next, explore the percentage revenue generated by the retail shop based on the countries their customers reside.

From the output above, the top $5$ countries with respect to revenue generated are:

with the United Kingdom recording the highest in percentage $(82.93\%)$.

In the next two cells, we update cntry_revenue_df with a new

In the two next cells, we group countries with revenue percentage value less than or equal to $0.25$ together and then plot a pie chart.

Predicting Customer Purchase

The goal of this section is to come up with a make a model using the given dataframe df_data, to estimate if a given customer will buy something again from the online shop in the next quarter.

The dataframe is split into two.

Let's find the first purchase made by each customer in the next quarter.

Let's find the last purchase made by each customer in the dataframe ctm_bhvr_dt.

Let's merge the two dataframes ctm_last_purchase_bhvr_dt and ctm_1st_purchase_in_next_quarter.

Let's calculate the time difference in days between customer's last purchase in the dataframe ctm_last_purchase_bhvr_dt and the first purchase in the dataframe ctm_1st_purchase_in_next_quarter.

Update the dataframe ctm_dt by filling all missing values with $9999$.

Next, we will define some features and add them to the dataframe ctm_dt to build our machine learning model. We will use the Recency - Frequency - Monetary Value segmentation method. That is, we will put the customers into groups based on the following:

After we will apply K-means clustering to assign customers a score to each of the features.

Recency

Let's find the most recent purchase date of each customer and see how many days they have been inactive. Afterwards, we can apply K-means clustering to assign customers a recency score.

The mean Recency is approximately $205$ days whiles the median is $161$ days.

Next we will apply K-means clustering to assign a recency score. However, we need to know how many clusters in order to use the K-means algorithm. We will apply Elbow Method to determine how many clusters we will need. The Elbow Method simply tells the optimal cluster number for optimal inertia.

From the Figure above, $4$ seem to be the optimal one.

Let's build a $4$ clusters for recency and add it to dataframe, ctm_dt.

Observe from the above that, $3$ covers the most recent customers whereas $0$ has the most inactive customers.

Frequency

Next, we will find customers purchase behaviour based on the number of times they buy from the online retail shop. That is, the total number of orders by each customer.

As it was for the case of the Recency, higher frequency number means better customers.

Revenue

Overall Score

Finally we sum all the cluster features

The scoring above clearly shows us that customers with score $8$ are our best customers whereas those who score $3$ are the worst.

Let us create a copy of the dataframe ctm_dt and apply the method get_dummies to it so as to convert all categorical column Segment to indicator variables.

Since our goal is to estimate whether a customer will make a purchase in the next quarter, we will create a new column NextPurchaseDayRange with values as either $1$ or $0$ defined as follows:

Finally in this section, let's see the correlation between our features and label. We can achieve this by applying the corr method to the dataframe ctm_dt.

From the output above, we observe that Overall Score has the highest positive correlation of $0.97$ with RecencyCluster and Segment_Low-Value has the highest negative of -0.99 with Segment_Mid-Value.

We can get a good visualisation of the coefficient matrix below.

Building Machine Learning Models

From this result, we see that LogisticRegression is the best in terms of accuracy.

Let's see how we could improve on the accuracy of the existing model XGB by finding suitable parameters. We will check if the improved XGB Classifier model outperforms the LogisticRegression

The XGB Classifier model gives us the following:

As we can see, the improved XGB classifier model is accurate than the LogisticRegression model by $0.2$.

Let us go ahead and predict with these two models.

Let compute the confusion matrices of these two models with the user-defined function get_confusion_matrix defined below.

Let's get the confusion matrix for y_test and xgb_pred_y, i.e., for the improved XGB classifier model and y_test and log_reg_pred_y, i.e., for the LogisticRegression model.

Since we are concerned with accuracy of our model, we will choose the improved XGB classifier model over the LogisticRegression model.

Conclusion

In summary I observed the following: